Research in Continuous Speech Recognition
نویسندگان
چکیده
The primary goal of this work is to develop improved methods and models for acoustic recognition of continuous speech. Most of the work has focused on deriving statistical models for speech recognition that can capture the acoustic-phonetic phenomena that occur in speech with the constraint that the models can be adequately est imated from a reasonable amount of training speech. Most of our work in phonetic recognition and word recognition over the past six years has involved hidden Markov models (HMMs). One significant contribution in this area has been a technique we developed for modeling the effects of phonetic coarticulation in a robust way. The technique is based on estimating context-dependent models of each of the phonemes. We have shown that this basic model (and its extensions) has resulted in a significant improvement in word and sentence recognition accuracy. We have also developed the "stochastic segment model", which can model the correlation between different parts of the phoneme directly. Initial experiments with this model on contextindependent phonetic units reduced the recognition error by a factor of two lower than for the corresponding context-independent HMM models. However, the new method requires significantly more computation. In general, collecting a large amount of speech (about 30 minutes) from the particular speaker who will use the system will result in the highest recognition accuracy. However, one may need to minimize the amount of training speech from a new speaker without losing performance. We have developed a "probabilistic spectral mapping" technique for adapting a model from one speaker to a new speaker based on a small amount of speech. Using this technique, the recognition accuracy with only 2 minutes of training from the new speaker is equal to that usually achieved using 20 minutes of speaker-dependent training. In this project we have combined various knowledge sources together to produce the BYBLOS speech recognition system. The system was tested on the DARPA Resource Management Database under several grammar conditions and resulted in higher recognition accuracies than had previously been reported for tasks of this complexity. In the area of real-time speech recognition we have pursued two activities: the implementat ion of our speech recognition algorithms on a general-purpose parallel processor and a joint effort with UC berkeley and SRI to design and build a special-purpose board-set capable of real-time continuous speech recognition with a large vocabulary (3000 words) and a statistical language model. In this latter activity, we provided the BYBLOS recognition code, and consulted on the changes that would be appropriate for a special purpose VLSI implementation. The first prototype of this board set is expected to be completed by mid t989. The parallel implementation research used the BBN Butterfly T M parallel processor. To achieve near linear parallel efficiency on large configurations (97 processors) required some tuning of the communication and synchronization procedures. The final result was a factor of 79 increase in speed on the 97-processor machine over the speed on a single processor. The near-real-time BYBLOS speech recognition system on a 32-processor Butterfly T M was demonstrated and used in several "live" tests.
منابع مشابه
Speaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملImproved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition
Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...
متن کاملEffects of ageing on speed and temporal resolution of speech stimuli in older adults
Background: According to previous studies, most of the speech recognition disorders in older adults are the results of deficits in audibility and auditory temporal resolution. In this paper, the effect of ageing on timecompressed speech and auditory temporal resolution by word recognition in continuous and interrupted noise was studied. Methods: A time-compressed speech test (TCST) w...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملSpeech Emotion Recognition Using Scalogram Based Deep Structure
Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concat...
متن کامل